On finding cross-lingual article pairs

نویسنده

  • Dirk Ahlers
چکیده

Finding a Wikipedia article in another language is often achievable with the in-built interlanguage links. We explore the possibility to automatically generate these links for geotagged articles as an application of entity resolution on an article level. It has the potential to improve Wikipedia, but also allows to use a well-curated ground truth for the merging algorithm. The resolution is based on only the simple features of coordinates and title. This is metadata that can be taken from APIs without parsing the full article itself. We use a conflation approach to identify articles with mismatched coordinates and a translation matrix tailored to the titles. Even complicated cases such as cities, municipalities, or departments with similar names at the same coordinates can mostly be identified correctly. Honduras was chosen as a test region because the country has a limited coverage (754 articles in both languages at time of writing [2]) that allows for a full manual assessment of results and because the resulting data is a basis for a geospatial search engine [1]. This finding has not been published in such brevity before, appropriate to the selection of features. BODYCross-lingual merging of Honduran geotagged Wikipedia articles based on ar-ticle names and locations alone results in 99.4% correct pairs. REFERENCES[1] D. Ahlers. Towards Geospatial Search for Honduras. In Proceedings of the LatinamericanConference on Networked and Electronic Media LACNEM 2011, San José, Costa Rica, 2011.Universidad Latina Costa Rica.[2] D. Ahlers. Of 754 Wikipedia articles geotagged in Honduras, 345 are from the Spanish version,409 are in English., 15.Jun.12, 7:52pm. Tweet.https://twitter.com/dirkahlers/statuses/213690505630990339. Volume 1 of Tiny Transactions on Computer ScienceThis content is released under the Creative Commons Attribution-NonCommercial ShareAlike License. Permission tomake digital or hard copies of all or part of this work is granted without fee provided that copies are not made ordistributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.CC BY-NC-SA 3.0: http://creativecommons.org/licenses/by-nc-sa/3.0/.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cross-Lingual Infobox Alignment in Wikipedia Using Entity-Attribute Factor Graph

Wikipedia infoboxes contain information about article entities in the form of attribute-value pairs, and are thus a very rich source of structured knowledge. However, as the different language versions of Wikipedia evolve independently, it is a promising but challenging problem to find correspondences between infobox attributes in different language editions. In this paper, we propose 8 effecti...

متن کامل

Finding Translation Examples for Under-Resourced Language Pairs or for Narrow Domains; the Case for Machine Translation

The cyberspace is populated with valuable information sources, expressed in about 1500 different languages and dialects. Yet, for the vast majority of WEB surfers this wealth of information is practically inaccessible or meaningless. Recent advancements in cross-lingual information retrieval, multilingual summarization, cross-lingual question answering and machine translation promise to narrow ...

متن کامل

Cross-lingual Models of Word Embeddings: An Empirical Comparison

Despite interest in using cross-lingual knowledge to learn word embeddings for various tasks, a systematic comparison of the possible approaches is lacking in the literature. We perform an extensive evaluation of four popular approaches of inducing cross-lingual embeddings, each requiring a different form of supervision, on four typologically different language pairs. Our evaluation setup spans...

متن کامل

Limitations of Cross-Lingual Learning from Image Search

Cross-lingual representation learning is an important step in making NLP scale to all the world’s languages. Recent work on bilingual lexicon induction suggests that it is possible to learn cross-lingual representations of words based on similarities between images associated with these words. However, that work focused on the translation of selected nouns only. In our work, we investigate whet...

متن کامل

SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation

Semantic Textual Similarity (STS) seeks to measure the degree of semantic equivalence between two snippets of text. Similarity is expressed on an ordinal scale that spans from semantic equivalence to complete unrelatedness. Intermediate values capture specifically defined levels of partial similarity. While prior evaluations constrained themselves to just monolingual snippets of text, the 2016 ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • TinyToCS

دوره 1  شماره 

صفحات  -

تاریخ انتشار 2012